remove DetachedEagleGPT model and handle all offline mode in the _DynamicEagleGPTModel #321

yeyu-nvidia · 2025-09-15T19:09:02Z

What does this PR do?

Type of change: refactor

Overview:
Following HF implementation, remove the dedicate Detached Eagle class and handle offline case in the same class as online mode.

Usage

# Add a code snippet demonstrating how to use this

Testing

Tested with both online and offline Qwen3 30B for EAGLE-3 training.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added an offline EAGLE execution path with sequence parallelism disabled.
- Supported additional EAGLE configuration options (self-logit distillation, freeze base model, report accuracy, reuse base decoder, loss decay, architecture config).
- Improved label/logit shape handling for offline training and multi-step losses.
Refactor
- Consolidated to a single unified registry for EAGLE models; offline registry removed.
- Removed detached/offline wrappers and related registrations.
- Streamlined conversion to always use the unified registry and updated forward paths for online/offline branching.

coderabbitai · 2025-09-15T19:09:10Z

Walkthrough

Removed the OfflineEagleDMRegistry and detached wrappers. Unified registration and conversion through EagleDMRegistry. Updated conversion to dynamically map subclasses and pass expanded Eagle config. In Megatron plugin, added explicit offline (eagle_offline) forward path, shape/indexing adjustments, and disabled sequence_parallel offline. HF transformers detached wrapper removed.

Changes

Cohort / File(s)	Summary of changes
Registry and conversion unification `modelopt/torch/speculative/eagle/conversion.py`	Deleted OfflineEagleDMRegistry; convert_to_eagle_model always uses EagleDMRegistry. Added dynamic subclass mapping when type not registered. Replaced registry.convert with EagleDMRegistry.convert. modify() now passes additional Eagle config fields (self_logit_distillation, freeze_base_model, report_acc, reuse_base_decoder, loss_decay_factor, architecture_config) while retaining offline/hidden_state_distillation flags.
Megatron EAGLE plugin: offline path and cleanup `modelopt/torch/speculative/plugins/megatron_eagle.py`	Removed OfflineEagleDMRegistry import and the _DetachedEagleGPTModel class/registration. Introduced explicit offline forward path using aux_hidden_states; disabled sequence_parallel when offline. Adjusted return_eagle_inputs handling (unsupported offline). Reworked loss/logits slicing and top-1 gathering for offline via shape-based indexing. Wired config (sequence_parallel, draft_vocab_size, has_lm_head) into Eagle module. Comments updated.
HF transformers plugin cleanup `modelopt/torch/speculative/plugins/transformers.py`	Removed DetachedHFEagleModel and its OfflineEagleDMRegistry registration. Integration relies solely on EagleDMRegistry.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant App
  participant Conversion as conversion.py
  participant Registry as EagleDMRegistry
  participant EagleModel

  User->>App: convert_to_eagle_model(model, config)
  App->>Conversion: convert_to_eagle_model(model, config)
  alt Type registered
    Conversion->>Registry: lookup type(model)
  else Type not registered
    Conversion->>Registry: iterate _registry for subclass match
    Note over Conversion,Registry: On match, map original class to base via register()
  end
  Conversion->>Registry: convert(model)
  Registry-->>Conversion: eagle_model
  Conversion->>EagleModel: modify(config: offline, hidden_state_distillation, self_logit_distillation, freeze_base_model, report_acc, reuse_base_decoder, loss_decay_factor, architecture_config)
  EagleModel-->>App: modified eagle_model

sequenceDiagram
  autonumber
  actor Trainer
  participant Model as MegatronEagleModel
  participant Base as BaseModel
  participant Eagle as EagleModule

  Trainer->>Model: forward(input_ids, labels, kwargs)
  alt Online
    Model->>Base: _base_model_forward(...)
    Base-->>Model: hidden_states
    opt return_eagle_inputs
      Model->>Model: _get_eagle_input_hidden_states(hidden_states, apply_fc=false)
      Model-->>Trainer: {input_ids, aux_hidden_states, hidden_states}
    end
    Model->>Model: _get_eagle_input_hidden_states(hidden_states, apply_fc=true)
  else Offline
    Note over Model: sequence_parallel disabled
    Model->>Model: use aux_hidden_states from kwargs
    Model->>Model: _get_eagle_input_hidden_states(aux_hidden_states, apply_fc=true)
    Note over Model: If labels len = input_ids-1, pad labels
  end
  Model->>Eagle: compute logits (multi-step)
  Note over Model,Eagle: Shape-based slicing for eagle_logits_N in offline
  Eagle-->>Trainer: outputs (logits, losses, etc.)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I hop through registries, light as air,
One Eagle to rule, no twins to compare.
Online I soar, offline I glide,
Shapes align where logits abide.
With tidy trails and configs anew,
I thump my paw—ship this brew! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately and concisely states the primary change: removal of the DetachedEagleGPT wrapper and consolidation of offline-mode handling into _DynamicEagleGPTModel. The diffs remove DetachedEagleGPT/DetachedHFEagleModel and shift offline logic into the dynamic model, which matches the title. The phrasing is specific and clear enough for a reviewer scanning PR history to understand the main intent.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yeyu/move_offline_eagle_to_online

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (1)

1-1: Replace or remove OfflineEagleDMRegistry usages in transformers.py

Found offline-registry references: import at modelopt/torch/speculative/plugins/transformers.py:49 and decorator usage @OfflineEagleDMRegistry.register at line 1144. No matches for DetachedEagle|DetachedEagleGPT or for return_eagle_inputs=True were found. Replace these uses with the supported registry API (e.g., EagleDMRegistry) or remove/update them to match the intended removal of the offline registry.

🧹 Nitpick comments (8)

modelopt/torch/speculative/eagle/conversion.py (1)

35-39: Avoid relying on private registry internals; register with a stable key and fail fast when unmapped.

Using EagleDMRegistry._registry directly is a private detail and brittle.
The key "base_model_class" is opaque; prefer a stable, informative key (e.g., fully-qualified class name).
If no base class match is found, EagleDMRegistry.convert will raise a KeyError later with less context. Raise a clear error when the mapping cannot be inferred.

Apply this diff to make the registration more robust and explicit:

-    if original_cls not in EagleDMRegistry:
-        for cls in EagleDMRegistry._registry:
-            if issubclass(original_cls, cls):
-                EagleDMRegistry.register({original_cls: "base_model_class"})(EagleDMRegistry[cls])
-                break
+    if original_cls not in EagleDMRegistry:
+        # Register a mapping for subclasses to the same dynamic class as their base.
+        found = False
+        for base_cls in list(EagleDMRegistry._registry.keys()):
+            if issubclass(original_cls, base_cls):
+                key = f"{original_cls.__module__}.{original_cls.__name__}"
+                EagleDMRegistry.register({original_cls: key})(EagleDMRegistry[base_cls])
+                found = True
+                break
+        if not found:
+            raise KeyError(
+                f"No Eagle dynamic mapping for {original_cls.__module__}.{original_cls.__name__}. "
+                f"Ensure a compatible base class is registered in EagleDMRegistry."
+            )

modelopt/torch/speculative/plugins/megatron_eagle.py (7)

748-751: Disable sequence_parallel in offline mode — add a guard log.

For clarity during debugging, emit a one-time info when flipping sequence_parallel to False in offline runs so users aren’t surprised by the override.

Apply this diff:

         # sequence_parallel is not used in offline eagle
         if self.eagle_offline:
-            self.config.sequence_parallel = False
+            if self.config.sequence_parallel:
+                warnings.warn("EAGLE offline: forcing sequence_parallel = False")
+            self.config.sequence_parallel = False

768-768: Update comment to reflect “offline” (detached class no longer exists).

Replace “detached eagle” with “offline eagle” to avoid confusion.

-        # layer ids are not used in detached eagle, but we need to set this to have correct fc_input_size_multiplier
+        # Layer IDs are not used in offline eagle, but we set them to get the correct fc_input_size_multiplier

853-866: Offline path: assert input shape assumptions and document expectations.

In offline mode, _get_eagle_input_hidden_states expects aux_hidden_states to already be concatenated ([s, b, k*h]). Add explicit asserts to catch silent shape mismatches.

     def _get_eagle_input_hidden_states(self, hidden_states: torch.Tensor, apply_fc: bool = True):
         """When _aux_hidden_states is not empty, then this is EAGLE-3.
@@
-        if not self.eagle_offline:
+        if not self.eagle_offline:
             if len(self._aux_hidden_states) == 0:
                 return hidden_states
@@
-        if apply_fc:
+        if apply_fc:
+            # In offline mode, hidden_states may already be [s, b, k*h]; validate k.
+            if self.eagle_offline:
+                h = self.config.hidden_size
+                assert hidden_states.shape[-1] % h == 0, (
+                    f"Expected aux_hidden_states hidden dim to be a multiple of {h}, "
+                    f"got {hidden_states.shape[-1]}"
+                )
             # [s / TP, b, 3h] -> [s / TP, b, h]
             return self.eagle_module.fc(hidden_states)[0]

1076-1086: Make loss alignment explicit and robust across online/offline lengths.

Inferring offline via labels.shape[1] < eagle_logits.shape[0] is brittle. Align by checking the delta and fail fast on unexpected shapes.

         if self.eagle_self_logit_distillation:
             mapping = self.eagle_module.d2t if hasattr(self.eagle_module, "d2t") else None
             token_loss = self.kld(eagle_logits[:-1, :, :], logits[1:, :, :], mapping)
-        elif labels.shape[1] < eagle_logits.shape[0]:
-            token_loss = self.compute_language_model_loss(labels[:, 1:], eagle_logits[:-2, :, :])
-        else:
-            token_loss = self.compute_language_model_loss(labels[:, 1:], eagle_logits[:-1, :, :])
+        else:
+            target_len = labels.shape[1] - 1
+            pred_len = eagle_logits.shape[0]
+            if pred_len - 1 == target_len:
+                aligned = eagle_logits[:-1, :, :]
+            elif pred_len - 2 == target_len:
+                aligned = eagle_logits[:-2, :, :]
+            else:
+                raise ValueError(
+                    f"Unexpected lengths for EAGLE loss: labels={labels.shape}, "
+                    f"eagle_logits={eagle_logits.shape}"
+                )
+            token_loss = self.compute_language_model_loss(labels[:, 1:], aligned)

1298-1301: Consolidate accuracy-slice logic to avoid off‑by‑one drift.

Four places differ only by the starting offset and offline end trim (-1 vs -2). Centralize this to a small helper to prevent future inconsistencies.

Add this helper (outside forward):

def _slice_for_acc(self, logits_1d: torch.Tensor, start: int) -> torch.Tensor:
    end = -2 if self.eagle_offline else -1
    return logits_1d[start:end, :, :]

Then replace:

eagle_logits_0[:-2 or :-1] -> self._slice_for_acc(eagle_logits_0, 0)
eagle_logits_1[1:-2 or 1:-1] -> self._slice_for_acc(eagle_logits_1, 1)
eagle_logits_2[2:-2 or 2:-1] -> self._slice_for_acc(eagle_logits_2, 2)
eagle_logits_3[3:-2 or 3:-1] -> self._slice_for_acc(eagle_logits_3, 3)

Also applies to: 1337-1343, 1379-1385, 1421-1426

1245-1249: Avoid eager .cpu() copies on hot paths.

If return_eagle_inputs is used frequently, consider non_blocking transfers to CPU to minimize stalls.

-            return {
-                "input_ids": input_ids.squeeze(0).cpu(),
-                "aux_hidden_states": eagle_module_input_hidden_states.squeeze(1).cpu(),
-                "hidden_states": hidden_states.squeeze(1).cpu(),
-            }
+            return {
+                "input_ids": input_ids.squeeze(0).to("cpu", non_blocking=True),
+                "aux_hidden_states": eagle_module_input_hidden_states.squeeze(1).to(
+                    "cpu", non_blocking=True
+                ),
+                "hidden_states": hidden_states.squeeze(1).to("cpu", non_blocking=True),
+            }

1461-1473: Minor: clear hooks along early returns.

You clear self._aux_hidden_states on early-return with no children. If forward hooks were registered conditionally, also consider guarding to avoid accumulating state across calls.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 76e8ce2 and e966230.

📒 Files selected for processing (2)

modelopt/torch/speculative/eagle/conversion.py (1 hunks)
modelopt/torch/speculative/plugins/megatron_eagle.py (16 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

modelopt/torch/speculative/eagle/conversion.py (1)

modelopt/torch/opt/dynamic.py (4)

original_cls (867-873)

register (1069-1096)

convert (613-672)

convert (1117-1119)

modelopt/torch/speculative/plugins/megatron_eagle.py (2)

modelopt/torch/speculative/utils.py (1)

get_default_attention_mask_and_position_ids (48-65)

modelopt/torch/speculative/plugins/transformers.py (1)

_base_model_forward (680-718)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (5)

modelopt/torch/speculative/eagle/conversion.py (1)

41-41: LGTM: unified conversion path through EagleDMRegistry.

Directly converting via EagleDMRegistry.convert(model) is consistent with the registry consolidation effort.

modelopt/torch/speculative/plugins/megatron_eagle.py (4)

55-55: LGTM: remove OfflineEagleDMRegistry import.

Import surface now matches the consolidated registry usage.

818-820: LGTM: propagate draft_vocab_size and has_lm_head into reused config.

Keeps EAGLE module output head consistent in reuse path.

1323-1331: LGTM: shape-based splitting of multi-step logits.

Indexing with logits_sbh.shape[0] makes the slicing resilient to seq-length variations.

Also applies to: 1364-1373, 1406-1415

1623-1653: LGTM: offline/online flows preserved in pseudo_speculative_generate.

No issues spotted; consistent with the new offline gating and SP handling.

modelopt/torch/speculative/plugins/megatron_eagle.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (1)

1196-1211: Addressed prior review: offline forward now validates inputs and forbids return_eagle_inputs.

This resolves the earlier “NoneType later” hazard and ambiguous semantics in offline mode.

🧹 Nitpick comments (4)

modelopt/torch/speculative/plugins/megatron_eagle.py (4)
748-751: Disable sequence_parallel in offline: good; add an explicit user-facing warning when flipping it.

Silent mutation of config can surprise callers. Emit a warning if it was True.

Apply:
         # sequence_parallel is not used in offline eagle
         if self.eagle_offline:
-            self.config.sequence_parallel = False
+            if self.config.sequence_parallel:
+                warnings.warn("EAGLE offline mode: forcibly disabling sequence_parallel.", stacklevel=1)
+            self.config.sequence_parallel = False
767-768: Update terminology: “detached” → “offline”.

Comment still mentions “detached eagle”; recommend aligning to “offline” to avoid confusion.
-        # layer ids are not used in detached eagle, but we need to set this to have correct fc_input_size_multiplier
+        # layer ids are not used in offline EAGLE, but we need to set this to have correct fc_input_size_multiplier
853-866: Offline/path gate in _get_eagle_input_hidden_states: OK. Consider defensive clearing.

Looks correct. As a safety edge-case when toggling modes within a long-lived object, consider clearing _aux_hidden_states on offline entry to avoid accidental reuse.
     def _get_eagle_input_hidden_states(self, hidden_states: torch.Tensor, apply_fc: bool = True):
@@
-        if not self.eagle_offline:
+        if not self.eagle_offline:
             if len(self._aux_hidden_states) == 0:
                 return hidden_states
             # [s / TP, b, len(self._aux_hidden_states) * h]
             hidden_states = torch.cat(self._aux_hidden_states, dim=-1)
             self._aux_hidden_states.clear()
+        else:
+            # Ensure no stale states leak in if mode was toggled.
+            self._aux_hidden_states.clear()
1083-1086: Tighten condition for offline label/logit alignment.

Use equality instead of less-than to avoid mis-slicing when labels are unexpectedly shorter.
-        elif labels.shape[1] < eagle_logits.shape[0]:
+        elif labels.shape[1] + 1 == eagle_logits.shape[0]:
             token_loss = self.compute_language_model_loss(labels[:, 1:], eagle_logits[:-2, :, :])

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e966230 and 3205916.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (16 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (3)

modelopt/torch/opt/dynamic.py (2)

config (1265-1277)

get (1013-1038)

modelopt/torch/speculative/utils.py (1)

get_default_attention_mask_and_position_ids (48-65)

modelopt/torch/speculative/plugins/transformers.py (1)

_base_model_forward (680-718)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (5)

modelopt/torch/speculative/plugins/megatron_eagle.py (5)

55-55: Registry consolidation LGTM.

Removing OfflineEagleDMRegistry import and relying on EagleDMRegistry is consistent with the PR goal.

818-820: Propagating draft_vocab_size and has_lm_head when reusing base decoder: LGTM.

Ensures output layer wiring matches reduced vocab configs.

1076-1077: Docstring update acknowledging offline label shapes: LGTM.

1292-1294: Shape-based slicing for multi-step chunks: LGTM.

Indexing using logits_sbh.shape[0] is more robust than relying on labels.

Also applies to: 1337-1338, 1379-1379, 1421-1421

1307-1308: Offline Top‑1 slice offsets: guard for short seq_len

':-2' (offline) yields zero-length spans for seq_len ≤ 2 (':-1' yields zero for seq_len ≤ 1). Clamp or condition the slice endpoints (e.g., use max(seq_len-2, 0) or skip/adjust metric computation) to avoid empty denominators.
Applies to modelopt/torch/speculative/plugins/megatron_eagle.py — lines 1307–1308, 1348–1349, 1390–1391, 1432–1433.

modelopt/torch/speculative/plugins/megatron_eagle.py

codecov · 2025-09-15T19:57:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.81%. Comparing base (b895dc5) to head (fc9cca8).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #321      +/-   ##
==========================================
- Coverage   73.82%   73.81%   -0.01%     
==========================================
  Files         172      172              
  Lines       17438    17436       -2     
==========================================
- Hits        12874    12871       -3     
- Misses       4564     4565       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

modelopt/torch/speculative/plugins/transformers.py (4)

561-576: Bug: dtype mismatch (bool vs float) when concatenating attention masks.

torch.cat over attention_mask_0 (float) and zero_mask/mask_2_2 (bool) will error. Keep masks in the same floating dtype and use 0/1 sentinels before masked_fill.

Apply:

-            zero_mask = torch.ones_like(attention_mask_0).bool()
+            zero_mask = torch.ones_like(attention_mask_0)
             mask_2_1 = attention_mask_0.clone().detach()
             mask_2_1[:, :, :, :-1] = mask_2_1[:, :, :, 1:]
-            mask_2_2 = torch.ones_like(attention_mask_0).bool()
+            mask_2_2 = torch.ones_like(attention_mask_0)
             for i in range(1, seq_length - 1):
-                mask_2_2[:, :, i, i] = False
+                mask_2_2[:, :, i, i] = 0
             cat_attention_mask = torch.cat(
                 (
                     torch.cat((attention_mask_0, zero_mask), dim=-1),
                     torch.cat((mask_2_1, mask_2_2), dim=-1),
                 ),
                 dim=-2,
             )
             cat_attention_mask = cat_attention_mask.masked_fill(cat_attention_mask == 1, dtypemin)

593-617: Same dtype bug in the 3-block concat path (second step).

Ensure all masks are float tensors; use 0/1 sentinels consistently.

-            zero_mask = torch.ones_like(attention_mask_0).bool()
+            zero_mask = torch.ones_like(attention_mask_0)
             mask_2_1 = attention_mask_0.clone().detach()
             mask_2_1[:, :, :, :-1] = mask_2_1[:, :, :, 1:]
-            mask_2_2 = torch.ones_like(attention_mask_0).bool()
+            mask_2_2 = torch.ones_like(attention_mask_0)
             for i in range(1, seq_length - 1):
-                mask_2_2[:, :, i, i] = False
+                mask_2_2[:, :, i, i] = 0
 
             mask_3_1 = mask_2_1.clone().detach()
             mask_3_1[:, :, :, :-1] = mask_3_1[:, :, :, 1:]
             mask_3_2 = mask_2_2.clone().detach()
             mask_3_2[:, :, :, :-1] = mask_3_2[:, :, :, 1:]
-            mask_3_2[:, :, 1, 0] = True
+            mask_3_2[:, :, 1, 0] = 1
             mask_3_3 = mask_2_2.clone().detach()
-            mask_3_3[:, :, 1, 1] = True
+            mask_3_3[:, :, 1, 1] = 1

633-671: Same dtype bug in the 4-block concat path (third step).

Fix bool/float mixing; keep sentinels numeric.

-            zero_mask = torch.ones_like(attention_mask_0).bool()
+            zero_mask = torch.ones_like(attention_mask_0)
             mask_2_1 = attention_mask_0.clone().detach()
             mask_2_1[:, :, :, :-1] = mask_2_1[:, :, :, 1:]
-            mask_2_2 = torch.ones_like(attention_mask_0).bool()
+            mask_2_2 = torch.ones_like(attention_mask_0)
             for i in range(1, seq_length - 1):
-                mask_2_2[:, :, i, i] = False
+                mask_2_2[:, :, i, i] = 0
 
             mask_3_1 = mask_2_1.clone().detach()
             mask_3_1[:, :, :, :-1] = mask_3_1[:, :, :, 1:]
             mask_3_2 = mask_2_2.clone().detach()
             mask_3_2[:, :, :, :-1] = mask_3_2[:, :, :, 1:]
-            mask_3_2[:, :, 1, 0] = True
+            mask_3_2[:, :, 1, 0] = 1
             mask_3_3 = mask_2_2.clone().detach()
-            mask_3_3[:, :, 1, 1] = True
+            mask_3_3[:, :, 1, 1] = 1
 
             mask_4_1 = mask_3_1.clone().detach()
             mask_4_1[:, :, :, :-1] = mask_4_1[:, :, :, 1:]
             mask_4_2 = mask_3_2.clone().detach()
             mask_4_2[:, :, :, :-1] = mask_4_2[:, :, :, 1:]
-            mask_4_2[:, :, 2, 0] = True
+            mask_4_2[:, :, 2, 0] = 1
             mask_4_3 = mask_3_3.clone().detach()
             mask_4_3[:, :, :, :-1] = mask_4_3[:, :, :, 1:]
-            mask_4_3[:, :, 2, 1] = True
+            mask_4_3[:, :, 2, 1] = 1
             mask_4_4 = mask_3_3.clone().detach()
-            mask_4_4[:, :, 2, 2] = True
+            mask_4_4[:, :, 2, 2] = 1

120-132: Parameter name mismatch — use cache_position, not rcache_position

transformers.py: forward(...) defines cache_position (lines 107 and 763) but the call at line 130 passes rcache_position=cache_position; rcache_position appears nowhere else — replace with cache_position=cache_position or verify the wrapped model actually expects rcache_position.

🧹 Nitpick comments (1)

modelopt/torch/speculative/plugins/transformers.py (1)
692-701: Use inference_mode for read-only blocks.

Replace torch.no_grad() with torch.inference_mode() where no tensors require grad to reduce dispatcher overhead and enable viewless inference.

Example:
-        with torch.no_grad() if freeze_base_model else contextlib.nullcontext():
+        with torch.inference_mode() if freeze_base_model else contextlib.nullcontext():
And similarly for the later embed-only sections.

Also applies to: 842-853, 874-888, 922-941, 965-984

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3205916 and 529a2f2.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/transformers.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: linux
GitHub Check: wait-checks / wait

🔇 Additional comments (1)

modelopt/torch/speculative/plugins/transformers.py (1)
49-49: Registry consolidation LGTM — verify no leftover Offline registry usages.

File: modelopt/torch/speculative/plugins/transformers.py — import now: from ..eagle.conversion import EagleDMRegistry

The rg search you ran returned no output; absence of matches is inconclusive — re-run locally/CI:
rg -nP -C2 '\b(OfflineEagleDMRegistry|DetachedEagleGPTModel|DetachedHFEagleModel)\b' || true
rg -n -S -C2 'OfflineEagle|DetachedEagle|eagle.conversion|EagleDMRegistry' || true

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (1)

1199-1211: Fix offline inputs: allow EAGLE‑1, validate args, and centralize PP early‑return.

Offline forward still requires aux_hidden_states and places the PP early‑return only in the online branch. This blocks EAGLE‑1 offline and can NPE when aux is absent. Move the PP early‑return out of the if/else and only require aux when use_aux_hidden_state=True. Also select src_states accordingly.

Apply this diff:

@@
-        if self.eagle_offline:
-            # aux_hidden_states and hidden_states are provided for offline eagle
-            # _base_model_forward is skipped
-            if return_eagle_inputs:
-                raise ValueError("return_eagle_inputs is unsupported in EAGLE offline mode.")
-            aux_hidden_states = kwargs.get("aux_hidden_states")
-            hidden_states = kwargs.get("hidden_states")
-            if aux_hidden_states is None or hidden_states is None:
-                raise ValueError(
-                    "EAGLE offline mode requires kwargs: aux_hidden_states=[s,b,k*h], "
-                    "hidden_states=[s,b,h]."
-                )
+        if self.eagle_offline:
+            # hidden_states (required) and aux_hidden_states (optional for EAGLE‑1) are provided for offline EAGLE.
+            # _base_model_forward is skipped
+            if return_eagle_inputs:
+                raise ValueError("return_eagle_inputs is unsupported in EAGLE offline mode.")
+            hidden_states = kwargs.get("hidden_states", None)
+            if hidden_states is None:
+                raise ValueError("EAGLE offline mode requires kwargs: hidden_states=[s,b,h].")
+            aux_hidden_states = kwargs.get("aux_hidden_states", None)
+            if self.eagle_config.use_aux_hidden_state and aux_hidden_states is None:
+                raise ValueError(
+                    "EAGLE‑3 offline additionally requires aux_hidden_states=[s,b,k*h] when "
+                    "use_aux_hidden_state=True."
+                )
         else:
@@
-            # Typically, this is only the case when PP > 1.
-            if not self.post_process:
-                return hidden_states
+        # Typically, this is only the case when PP > 1.
+        if not self.post_process:
+            return hidden_states
@@
-        if self.eagle_offline:
-            eagle_module_input_hidden_states = self._get_eagle_input_hidden_states(
-                aux_hidden_states, apply_fc=self.eagle_config.use_aux_hidden_state
-            )
+        if self.eagle_offline:
+            src_states = (
+                aux_hidden_states if self.eagle_config.use_aux_hidden_state else hidden_states
+            )
+            eagle_module_input_hidden_states = self._get_eagle_input_hidden_states(
+                src_states, apply_fc=self.eagle_config.use_aux_hidden_state
+            )

Also applies to: 1226-1229, 1235-1238

🧹 Nitpick comments (6)

modelopt/torch/speculative/plugins/megatron_eagle.py (6)

1241-1256: Preserve batch dims for return_eagle_inputs (avoid squeeze assumptions).

Squeezing dim 1 breaks when batch_size > 1. Return tensors with their batch dims intact.

-            return {
-                "input_ids": input_ids.squeeze(0).cpu(),
-                "aux_hidden_states": eagle_module_input_hidden_states.squeeze(1).cpu(),
-                "hidden_states": hidden_states.squeeze(1).cpu(),
-            }
+            return {
+                "input_ids": input_ids.cpu(),
+                "aux_hidden_states": eagle_module_input_hidden_states.cpu(),
+                "hidden_states": hidden_states.cpu(),
+            }

767-773: Update comment: “detached eagle” wording is outdated.

This PR removes DetachedEagle; update the comment to reflect “offline EAGLE” semantics.

-        # layer ids are not used in detached eagle, but we need to set this to have correct fc_input_size_multiplier
+        # Layer IDs are not used in offline EAGLE‑1, but we set them for correct fc_input_size_multiplier in EAGLE‑3.

1083-1085: Make offline loss branch explicit (avoid shape‑heuristic).

Using labels.shape[1] < eagle_logits.shape[0] is brittle. Gate by eagle_offline and optionally assert expected alignment.

-        elif labels.shape[1] < eagle_logits.shape[0]:
+        elif self.eagle_offline:
+            assert eagle_logits.shape[0] >= labels.shape[1] + 1, \
+                "Offline EAGLE expects logits length >= labels length + 1."
             token_loss = self.compute_language_model_loss(labels[:, 1:], eagle_logits[:-2, :, :])

1309-1311: DRY the offline/online top‑1 slicing logic.

The repeated ternaries ([:-2] vs. [:-1]) are easy to drift. Factor a small helper to slice logits consistently.

+    def _slice_for_top1(self, logits):
+        return logits[:-2, :, :] if self.eagle_offline else logits[:-1, :, :]
@@
-                gathered_logits = gather_from_tensor_model_parallel_region(
-                    eagle_logits_0[:-2, :, :] if self.eagle_offline else eagle_logits_0[:-1, :, :]
-                )
+                gathered_logits = gather_from_tensor_model_parallel_region(
+                    self._slice_for_top1(eagle_logits_0)
+                )
@@
-                gathered_logits = gather_from_tensor_model_parallel_region(
-                    eagle_logits_1[1:-2, :, :] if self.eagle_offline else eagle_logits_1[1:-1, :, :]
-                )
+                gathered_logits = gather_from_tensor_model_parallel_region(
+                    self._slice_for_top1(eagle_logits_1[1:, :, :])
+                )
@@
-                gathered_logits = gather_from_tensor_model_parallel_region(
-                    eagle_logits_2[2:-2, :, :] if self.eagle_offline else eagle_logits_2[2:-1, :, :]
-                )
+                gathered_logits = gather_from_tensor_model_parallel_region(
+                    self._slice_for_top1(eagle_logits_2[2:, :, :])
+                )
@@
-                gathered_logits = gather_from_tensor_model_parallel_region(
-                    eagle_logits_3[3:-2, :, :] if self.eagle_offline else eagle_logits_3[3:-1, :, :]
-                )
+                gathered_logits = gather_from_tensor_model_parallel_region(
+                    self._slice_for_top1(eagle_logits_3[3:, :, :])
+                )

Also applies to: 1350-1351, 1392-1393, 1434-1435

846-866: Online/Offline hidden state selection looks correct; small doc tweak.

Logic to bypass aux accumulation in offline is right. Consider noting in the docstring that offline passes either aux (EAGLE‑3) or base hidden_states (EAGLE‑1).

1251-1251: Optional: avoid unnecessary full TP gather of base logits on return_eagle_inputs.

If the only consumer is offline precompute, gathering logits_sbh here is unnecessary. Returning CPU’d inputs without TP gather can reduce overhead.

If you keep it, please confirm a downstream consumer requires full‑vocab logits. Otherwise, drop this gather.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc27788 and f8dc859.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (16 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (3)

modelopt/torch/opt/dynamic.py (2)

config (1265-1277)

get (1013-1038)

modelopt/torch/speculative/utils.py (1)

get_default_attention_mask_and_position_ids (48-65)

modelopt/torch/speculative/plugins/transformers.py (1)

_base_model_forward (680-718)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: wait-checks / wait
GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (3)

modelopt/torch/speculative/plugins/megatron_eagle.py (3)
141-144: Doc vs. implementation mismatch: “at least” should be >= not >.

[ suggest_nitpick_refactor ]
-    """Check if megatron-core is least this version."""
-    return Version(megatron.core.__version__) > Version(target_version)
+    """Check if megatron-core is at least this version."""
+    return Version(megatron.core.__version__) >= Version(target_version)
818-820: Good: propagate draft_vocab_size/has_lm_head when reusing base decoder.

This keeps EagleModule consistent with the reduced draft vocab. 👍

748-751: Disabling sequence_parallel in offline: confirm interaction with checkpoints and SP‑trained bases.

ripgrep shows gather_from_sequence_parallel_region / scatter_to_sequence_parallel_region are used throughout megatron_eagle.py but guarded by if self.config.sequence_parallel (examples: ~lines 901–904, 941–943, 961–964, 1245–1250, 1510–1511, 1547–1549, 1670–1671) — so forcing sequence_parallel=False will skip those calls, but this alone doesn't guarantee safety.

Verify offline runs do not load SP‑sharded checkpoints or per‑rank weight shards (search for load_state_dict / load_checkpoint / state_dict / tensor_model_parallel loader code).

Ensure no tensor‑sharded buffers are referenced unguarded during offline init; add an explicit assertion or a conversion/loader branch when eagle_offline=True to fail fast if an SP checkpoint is attempted to be loaded.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

modelopt/torch/speculative/plugins/megatron_eagle.py (3)

1199-1211: Offline EAGLE-1 is blocked; make aux_hidden_states optional and validate by config.

Requiring aux_hidden_states unconditionally prevents offline EAGLE‑1. Gate the requirement on use_aux_hidden_state and always require hidden_states. Also reject return_eagle_inputs early (kept).

Apply:

         if self.eagle_offline:
             # aux_hidden_states and hidden_states are provided for offline eagle
             # _base_model_forward is skipped
             if return_eagle_inputs:
                 raise ValueError("return_eagle_inputs is unsupported in EAGLE offline mode.")
-            aux_hidden_states = kwargs.get("aux_hidden_states")
-            hidden_states = kwargs.get("hidden_states")
-            if aux_hidden_states is None or hidden_states is None:
-                raise ValueError(
-                    "EAGLE offline mode requires kwargs: aux_hidden_states=[s,b,k*h], "
-                    "hidden_states=[s,b,h]."
-                )
+            hidden_states = kwargs.get("hidden_states", None)
+            if hidden_states is None:
+                raise ValueError("EAGLE offline mode requires kwargs: hidden_states=[s,b,h].")
+            aux_hidden_states = kwargs.get("aux_hidden_states", None)
+            if self.eagle_config.use_aux_hidden_state and aux_hidden_states is None:
+                raise ValueError(
+                    "EAGLE‑3 offline additionally requires aux_hidden_states=[s,b,k*h] when "
+                    "use_aux_hidden_state=True."
+                )

1226-1228: Centralize PP early‑return so it runs for both online and offline paths.

Right now the early return is only reachable in the online branch. Move it out so offline PP does not proceed to output/loss.

-            # Typically, this is only the case when PP > 1.
-            if not self.post_process:
-                return hidden_states
+        # Typically, this is only the case when PP > 1.
+        if not self.post_process:
+            return hidden_states

1235-1238: Offline EAGLE input should come from aux or base hidden by config.

Use aux when use_aux_hidden_state=True (EAGLE‑3), otherwise pass hidden_states (EAGLE‑1). Also set apply_fc accordingly.

-        if self.eagle_offline:
-            eagle_module_input_hidden_states = self._get_eagle_input_hidden_states(
-                aux_hidden_states, apply_fc=self.eagle_config.use_aux_hidden_state
-            )
+        if self.eagle_offline:
+            src_states = (
+                aux_hidden_states
+                if self.eagle_config.use_aux_hidden_state
+                else hidden_states
+            )
+            eagle_module_input_hidden_states = self._get_eagle_input_hidden_states(
+                src_states, apply_fc=self.eagle_config.use_aux_hidden_state
+            )

🧹 Nitpick comments (4)

modelopt/torch/speculative/plugins/megatron_eagle.py (4)

1306-1321: Guard accuracy reporting against empty slices (short sequences).

When seq_len is too short, slices like[:-2] can be empty → argmax and division by zero crash. Skip metrics if there is no token to score.

Apply to each block (1st→4th) similarly; example shown for the 1st block:

-            with torch.no_grad():
-                gathered_logits = gather_from_tensor_model_parallel_region(
-                    eagle_logits_0[:-2, :, :] if self.eagle_offline else eagle_logits_0[:-1, :, :]
-                )
-                eagle_top1 = gathered_logits.transpose(0, 1).argmax(dim=-1)
-                if self.eagle_config.draft_vocab_size != self.eagle_config.vocab_size:
-                    eagle_top1 += self.eagle_module.d2t[eagle_top1]
-                top1_p = torch.eq(labels[:, 1:], eagle_top1).sum() / eagle_top1.numel()
-                acc.append(top1_p)
+            with torch.no_grad():
+                sl = eagle_logits_0.shape[0] - (2 if self.eagle_offline else 1)
+                if sl > 0:
+                    logits_slice = (
+                        eagle_logits_0[:sl, :, :]
+                    )
+                    gathered_logits = gather_from_tensor_model_parallel_region(logits_slice)
+                    eagle_top1 = gathered_logits.transpose(0, 1).argmax(dim=-1)
+                    if self.eagle_config.draft_vocab_size != self.eagle_config.vocab_size:
+                        eagle_top1 += self.eagle_module.d2t[eagle_top1]
+                    denom = max(1, eagle_top1.numel())
+                    top1_p = torch.eq(labels[:, 1:], eagle_top1).sum() / denom
+                    acc.append(top1_p)

Repeat the same shape guard for the 2nd, 3rd, and 4th accuracy blocks.

Also applies to: 1348-1363, 1390-1405, 1432-1447

1076-1086: Loss shape handling: prefer explicit offline flag over shape heuristics.

Using labels.shape[1] < eagle_logits.shape[0] to infer offline mode is brittle. Gate the[:-2] slice with self.eagle_offline for clarity and safety.

-        elif labels.shape[1] < eagle_logits.shape[0]:
-            token_loss = self.compute_language_model_loss(labels[:, 1:], eagle_logits[:-2, :, :])
-        else:
-            token_loss = self.compute_language_model_loss(labels[:, 1:], eagle_logits[:-1, :, :])
+        else:
+            slice_ = eagle_logits[:-2, :, :] if self.eagle_offline else eagle_logits[:-1, :, :]
+            token_loss = self.compute_language_model_loss(labels[:, 1:], slice_)

768-771: Stale terminology: “detached eagle” in comment.

Detached wrapper was removed; update the comment to “offline EAGLE”.

-        # layer ids are not used in detached eagle, but we need to set this to have correct fc_input_size_multiplier
+        # Layer ids are not used in offline EAGLE, but we set them to derive the correct fc_input_size_multiplier.

640-647: Comment contradicts code: clarify EAGLE‑1/‑3 inputs.

Code concatenates [embeddings, hidden_states] → 2h then fc for EAGLE‑1; EAGLE‑3 uses aux + specialized attention to reduce 2h→h. Update the comment.

-        # EAGLE-1 uses [s, b, h] input but EAGLE-3 uses [s, b, 2h] input
+        # EAGLE‑1: concat [embeddings, hidden_states] → [s,b,2h], then fc → [s,b,h].
+        # EAGLE‑3: consumes aux/hidden; first attention reduces 2h → h (specialized qkv).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8dc859 and fa48315.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (16 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (3)

modelopt/torch/opt/dynamic.py (2)

config (1265-1277)

get (1013-1038)

modelopt/torch/speculative/utils.py (1)

get_default_attention_mask_and_position_ids (48-65)

modelopt/torch/speculative/plugins/transformers.py (1)

_base_model_forward (680-718)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: linux
GitHub Check: wait-checks / wait

🔇 Additional comments (2)

modelopt/torch/speculative/plugins/megatron_eagle.py (2)

748-751: Disabling sequence_parallel in offline mode looks correct.

This avoids ColumnParallelLinear sequence gathers that don't apply to precomputed inputs.

Confirm there’s no other SP‑dependent op executed in offline path (e.g., gathering/scattering around Eagle inputs).

818-820: Draft‑vocab and has_lm_head propagation: LGTM.

Keeps EAGLE module output layer consistent with reduced vocab scenarios.

ChenhanYu · 2025-09-15T21:29:39Z

Need to run the regression test to see if any impact to the megatron online training.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

modelopt/torch/export/plugins/megatron_importer.py (1)

515-521: Harden softmax_offset gating to avoid AttributeError/KeyError

Good tighten on the None check. However, hasattr(attention.core_attention, ...) still dereferences attention.core_attention unconditionally and will raise if attention lacks that attribute. Also, indexing self.rules["softmax_offset"] can KeyError when the rule isn’t present for some arch. Guard both.

Apply this diff:

-                        if (
-                            hasattr(attention.core_attention, "softmax_offset")
-                            and attention.core_attention.softmax_offset is not None
-                        ):
-                            self.rules["softmax_offset"](
-                                attention.core_attention.softmax_offset, layer_id
-                            )
+                        core_attn = getattr(attention, "core_attention", None)
+                        softmax_offset = getattr(core_attn, "softmax_offset", None)
+                        softmax_rule = self.rules.get("softmax_offset")
+                        if softmax_offset is not None and callable(softmax_rule):
+                            softmax_rule(softmax_offset, layer_id)

Please confirm all non-MLA attention variants hit here always have core_attention (or this guard will be needed). Also confirm the rule exists for arches where softmax_offset is expected (no unexpected skips).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 249250f and 2c25a3e.

📒 Files selected for processing (1)

modelopt/torch/export/plugins/megatron_importer.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: wait-checks / wait
GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

Signed-off-by: Ye Yu <[email protected]>

Signed-off-by: Kinjal Patel <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Signed-off-by: Asha Anoosheh <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Signed-off-by: realAsma <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Ye Yu <[email protected]>

…specific unitests; (#318) Signed-off-by: realAsma <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Signed-off-by: Riyad Islam <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Signed-off-by: Ye Yu <[email protected]>

This reverts commit 9450e0d. Signed-off-by: Ye Yu <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (1)

1194-1208: Validate offline mode inputs and forbid return_eagle_inputs.

The offline forward path correctly validates required inputs but there's a potential issue with the validation logic.

This matches the existing comment about validating required inputs and forbidding return_eagle_inputs in offline mode.

🧹 Nitpick comments (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (1)
767-767: Clarify the comment about offline mode.

The comment mentions that "layer ids are not used in offline eagle" but this could be more precise. EAGLE-3 still uses layer IDs internally for determining the FC input size multiplier, even in offline mode.

Apply this diff to improve the comment clarity:
-        # layer ids are not used in offline eagle, but we need to set this to have correct fc_input_size_multiplier
+        # layer ids are not actively extracted in offline eagle, but we still need to set this to have correct fc_input_size_multiplier

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c25a3e and 9232200.

📒 Files selected for processing (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/speculative/plugins/megatron_eagle.py (3)

modelopt/torch/opt/dynamic.py (2)

config (1265-1278)

get (1013-1038)

modelopt/torch/speculative/utils.py (1)

get_default_attention_mask_and_position_ids (48-65)

modelopt/torch/speculative/plugins/transformers.py (1)

_base_model_forward (680-718)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (8)

modelopt/torch/speculative/plugins/megatron_eagle.py (8)

818-820: Configuration parameters propagated correctly.

The addition of draft_vocab_size and has_lm_head parameters to the eagle_config ensures proper configuration of the EagleModule for both online and offline modes.

847-866: Offline/online mode handling in input hidden states.

The logic correctly handles both offline and online modes. For offline mode, it uses the provided aux_hidden_states directly; for online mode, it concatenates the aux_hidden_states collected from forward hooks.

1209-1222: Online forward path preserved correctly.

The base model forward path is correctly preserved for online mode with proper parameter passing to _base_model_forward.

1224-1226: Pipeline parallel early return centralized.

The early return for non-post-process pipeline stages is now properly centralized and applies to both online and offline modes, which is correct.

1233-1236: Offline eagle input processing.

The offline mode correctly processes aux_hidden_states using _get_eagle_input_hidden_states with the appropriate apply_fc parameter based on self.eagle_config.use_aux_hidden_state.

1238-1254: Return eagle inputs logic preserved.

The logic for returning eagle inputs in online mode is preserved and correctly returns the expected dictionary format with input_ids, aux_hidden_states, and hidden_states.

1282-1293: Labels padding for offline mode.

The logic handles the case where labels may be one token shorter than input_ids in offline mode by padding with zeros. This is a reasonable approach for offline training scenarios.

1344-1344: No change required — labels-based slicing is correct.

The code pads offline labels when labels.shape[1] == input_ids.shape[1] - 1 (megatron_eagle.py ~1282–1286) and uses labels.shape[1] consistently for eagle_logits slicing across drafts (lines ~1301, 1344, 1386, 1428), so the dynamic slicing handles both padded and unpadded cases.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

modelopt/torch/speculative/plugins/megatron_eagle.py (2)
1197-1208: ****

Offline forward path: validate required inputs and forbid return_eagle_inputs.

If eagle_offline=True, aux_hidden_states and hidden_states must be provided; currently missing keys lead to NoneType errors later.

return_eagle_inputs should be rejected in offline mode to avoid ambiguous semantics.

1224-1226: PP early return should be outside the online/offline branching.

The pipeline parallel early return (if not self.post_process) is only in the online path. This prevents offline mode from working correctly with pipeline parallelism.

Move the PP check outside of the conditional to handle both paths:
             return_eagle_inputs=return_eagle_inputs,
         )
 
-        # Typically, this is only the case when PP > 1.
-        if not self.post_process:
-            return hidden_states
+    # Typically, this is only the case when PP > 1.
+    if not self.post_process:
+        return hidden_states

🧹 Nitpick comments (2)

modelopt/torch/speculative/plugins/megatron_eagle.py (2)
1344-1344: Inconsistent indexing for eagle_logits extraction in offline mode.

The code uses logits_sbh.shape[0] for eagle_logits slicing which differs between online and offline modes. Using negative indexing (-labels.shape[1]:) would be more consistent and less error-prone.

Apply consistent negative indexing for all eagle_logits extractions:
-        eagle_logits_1 = eagle_logits_2x[-labels.shape[1] :, :, :]
+        eagle_logits_1 = eagle_logits_2x[-labels.shape[1]:, :, :]
 
-        eagle_logits_2 = eagle_logits_3x[-labels.shape[1] :, :, :]
+        eagle_logits_2 = eagle_logits_3x[-labels.shape[1]:, :, :]
 
-        eagle_logits_3 = eagle_logits_4x[-labels.shape[1] :, :, :]
+        eagle_logits_3 = eagle_logits_4x[-labels.shape[1]:, :, :]
Also applies to: 1386-1386, 1428-1428

1282-1292: Replace zero-padding with an ignore-index (or warn) when labels are 1 token short

labels are padded with 0 when labels.shape[1] == input_ids.shape[1] - 1 (modelopt/torch/speculative/plugins/megatron_eagle.py:1282-1292), which will make the last token contribute to loss if logit distillation is disabled. eagle_self_logit_distillation exists in the codebase (passed through config and asserted in the plugin), but the padding branch still runs in some cases — avoid training corruption by masking instead of zero-padding.

Recommended fix (preferred): pad with the loss ignore index (e.g. -100) or the project’s IGNORE_TOKEN_ID:
right_token_pad = torch.full((labels.shape[0], 1), -100, dtype=labels.dtype, device=labels.device)
labels = torch.cat((labels, right_token_pad), dim=-1)

Alternate (minimum): log a warning when padding happens and logit distillation is disabled:
if not self.eagle_self_logit_distillation:
import warnings; warnings.warn("Label padding in offline mode may introduce training errors when logit distillation is disabled.")

Location: modelopt/torch/speculative/plugins/megatron_eagle.py lines ~1282-1292 (padding) and ~736-761 (eagle_self_logit_distillation use/assert).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c25a3e and 0c203e0.

📒 Files selected for processing (3)

modelopt/torch/speculative/eagle/conversion.py (1 hunks)
modelopt/torch/speculative/plugins/megatron_eagle.py (9 hunks)
modelopt/torch/speculative/plugins/transformers.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/speculative/plugins/transformers.py

🧰 Additional context used

🧬 Code graph analysis (2)

modelopt/torch/speculative/eagle/conversion.py (2)

modelopt/torch/opt/dynamic.py (4)

original_cls (867-873)

register (1069-1096)

convert (613-672)

convert (1117-1119)

modelopt/torch/speculative/mode.py (2)

convert (51-53)

convert (79-81)

modelopt/torch/speculative/plugins/megatron_eagle.py (3)

modelopt/torch/opt/dynamic.py (2)

config (1265-1278)

get (1013-1038)

modelopt/torch/speculative/utils.py (1)

get_default_attention_mask_and_position_ids (48-65)

modelopt/torch/speculative/plugins/transformers.py (1)

_base_model_forward (680-718)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (3)

modelopt/torch/speculative/eagle/conversion.py (2)

34-39: LGTM! Dynamic registry registration handles offline mode elegantly.

The subclass detection and registration logic correctly enables dynamic modules that aren't explicitly registered to work with the Eagle registry, which is essential for handling offline mode models.

41-51: All Eagle config fields are handled — no change required.

Base modify (modelopt/torch/speculative/eagle/eagle_model.py:29–51) assigns the passed params; plugins/transformers.py (371–418) and plugins/megatron_eagle.py (712–806, and usages throughout) consume and validate them (e.g., eagle_architecture_config → PretrainedConfig/from_dict and parallel/aux checks; eagle_freeze_base_model freezes params; eagle_loss_decay_factor used in loss scaling; eagle_self_logit_distillation validated against draft_vocab_size).

modelopt/torch/speculative/plugins/megatron_eagle.py (1)

846-865: LGTM! Clear separation of online/offline logic in _get_eagle_input_hidden_states.

The method correctly handles both online and offline modes, with appropriate FC application based on the eagle_offline flag and aux_hidden_states availability.

modelopt/torch/speculative/plugins/megatron_eagle.py

coderabbitai · 2025-09-18T20:14:18Z

modelopt/torch/speculative/plugins/megatron_eagle.py

                eagle_config.use_mtp_layernorm = self.eagle_config.use_mtp_layernorm
+                eagle_config.draft_vocab_size = self.eagle_config.draft_vocab_size
+                eagle_config.has_lm_head = self.eagle_config.has_lm_head


⚠️ Potential issue

🧩 Analysis chain

Ensure draft_vocab_size and has_lm_head are properly initialized.

The code passes draft_vocab_size and has_lm_head to the eagle_config, but these fields need to exist in the eagle_architecture_config.

🏁 Script executed:

#!/bin/bash # Verify that draft_vocab_size and has_lm_head are present in eagle_architecture_config rg -n "eagle_architecture_config.*draft_vocab_size|eagle_architecture_config.*has_lm_head" --type py # Check for default values or initialization of these fields ast-grep --pattern 'eagle_architecture_config.get("draft_vocab_size"$$$)' ast-grep --pattern 'eagle_architecture_config.get("has_lm_head"$$$)'

Length of output: 329

Initialize draft_vocab_size and has_lm_head in eagle_architecture_config

modelopt/torch/speculative/plugins/megatron_eagle.py:817–819 assigns eagle_config.draft_vocab_size and eagle_config.has_lm_head, but those keys only appear in tests (tests/gpu/torch/speculative/plugins/test_speculative_megatron_modules.py:81,231). Add these fields (with sensible defaults) to the eagle_architecture_config definition or validate/guard before assigning to avoid runtime errors.

🤖 Prompt for AI Agents

In modelopt/torch/speculative/plugins/megatron_eagle.py around lines 817-819, the code unconditionally assigns eagle_config.draft_vocab_size and eagle_config.has_lm_head but those fields are not defined on eagle_architecture_config and only referenced in tests; either add draft_vocab_size (default 0 or None) and has_lm_head (default False) to the eagle_architecture_config dataclass/definition where the config is declared, or wrap these assignments with guards like "if hasattr(self.eagle_config, 'draft_vocab_size')/hasattr(self.eagle_config, 'has_lm_head')" before assigning to avoid AttributeError at runtime—choose adding fields if they are intrinsic to the model, otherwise guard the assignments and document the defaults.

modelopt/torch/speculative/plugins/megatron_eagle.py

ChenhanYu

Left a question.

ChenhanYu

LGTM;

yeyu-nvidia requested a review from a team as a code owner September 15, 2025 19:09

yeyu-nvidia requested a review from ChenhanYu September 15, 2025 19:09

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

modelopt/torch/speculative/plugins/megatron_eagle.py Show resolved Hide resolved

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

modelopt/torch/speculative/plugins/megatron_eagle.py Show resolved Hide resolved

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

yeyu-nvidia self-assigned this Sep 15, 2025

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

yeyu-nvidia requested a review from a team as a code owner September 16, 2025 18:15

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

yeyu-nvidia and others added 17 commits September 18, 2025 12:55

remove detachedeagle and handle all offline mode in the eagle model

f4cadcc

Signed-off-by: Ye Yu <[email protected]>

apply coderabbit suggestion

dd2cef6

Signed-off-by: Ye Yu <[email protected]>

remove OfflineEagleDMRegistry

6cf677e

Signed-off-by: Ye Yu <[email protected]>

apply suggestion to cover eagle1 case

e959ae1

Signed-off-by: Ye Yu <[email protected]>

debug

9ba9935

Signed-off-by: Ye Yu <[email protected]>

debug

0f8493a

Signed-off-by: Ye Yu <[email protected]>

minor

0670b8b

Signed-off-by: Ye Yu <[email protected]>

fix the bug in megatron import

86bc66e

Signed-off-by: Ye Yu <[email protected]>

Added support for qwen3-next quantization and export (#323)

a26a5a6

Signed-off-by: Kinjal Patel <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Update distill Megatron plugin (#319)

742429d

Signed-off-by: Asha Anoosheh <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Remove unused utilities for ModelOpt <0.29 MCore checkpoints (#322)

62c1c99

Signed-off-by: realAsma <[email protected]> Signed-off-by: Ye Yu <[email protected]>

Upgrade TensorRT-LLM docker to 1.1.0RC2 (#327)

6a3edec

Signed-off-by: Chenjie Luo <[email protected]> Signed-off-by: Ye Yu <[email protected]>

[1/N] QATTrainer training workflow fixes and clean up; Added backend …

68abf3e

…specific unitests; (#318) Signed-off-by: realAsma <[email protected]> Signed-off-by: Ye Yu <[email protected]>

import fix for torch 2.9 (#315)

e40fb07

Signed-off-by: Riyad Islam <[email protected]> Signed-off-by: Ye Yu <[email protected]>

checkout main branch's fix for megatron+importer

8f25717

Signed-off-by: Ye Yu <[email protected]>

pad labels if it's 1 token shorter than input_ids

912e427

Signed-off-by: Ye Yu <[email protected]>

Revert "pad labels if it's 1 token shorter than input_ids"

0582079

This reverts commit 9450e0d. Signed-off-by: Ye Yu <[email protected]>

yeyu-nvidia requested review from kevalmorabia97 and sugunav14 September 18, 2025 19:55

Merge branch 'main' into yeyu/move_offline_eagle_to_online

0c203e0

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

kevalmorabia97 removed request for a team, realAsma and sugunav14 September 18, 2025 21:01

Merge branch 'main' into yeyu/move_offline_eagle_to_online

fc9cca8

ChenhanYu reviewed Sep 18, 2025

View reviewed changes

modelopt/torch/speculative/plugins/megatron_eagle.py Show resolved Hide resolved

ChenhanYu reviewed Sep 18, 2025

View reviewed changes

ChenhanYu self-requested a review September 18, 2025 21:29

ChenhanYu approved these changes Sep 18, 2025

View reviewed changes

yeyu-nvidia enabled auto-merge (squash) September 18, 2025 21:30

yeyu-nvidia merged commit 00a7e60 into main Sep 18, 2025
22 checks passed

yeyu-nvidia deleted the yeyu/move_offline_eagle_to_online branch September 18, 2025 22:48

remove DetachedEagleGPT model and handle all offline mode in the _DynamicEagleGPTModel #321

remove DetachedEagleGPT model and handle all offline mode in the _DynamicEagleGPTModel #321

Uh oh!

Conversation

yeyu-nvidia commented Sep 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ChenhanYu commented Sep 15, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

yeyu-nvidia commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 15, 2025 •

edited

Loading

codecov bot commented Sep 15, 2025 •

edited

Loading